A  Theory  of  Pattern  Rejection 

Simon  Baker  and  Shree  K.  Nayar 

Columbia  Univeristy  Technical  Report 
CUCS-013-95 

Department  of  Computer  Science 
Columbia  University 
New  York,  NY  10027 


Email:  {simonb, nayar }@cs. columbia.edu 
Tel.  +1  (212)  939-7000,  Fax  +1  (212)  666-0140 


Report  Documentation  Page 

Form  Approved 

0MB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 

1.  REPORT  DATE 

may  1995  2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-1995  to  00-00-1995 

4.  TITLE  AND  SUBTITLE 

A  Theory  of  Pattern  Rejection 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Columbia  University  , Department  of  Computer  Science, New 

York, NY, 10027 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF:  17.  LIMITATION  OF 

_ _ _  ABSTRACT 

18.  NUMBER  19a.  NAME  OF 

OF  PAGES  RESPONSIBLE  PERSON 

a.  REPORT  b.  ABSTRACT  c.  THIS  PAGE  Same  aS 

unclassified  unclassified  unclassified  Report  (SAR) 

31 

Standard  Form  298  (Rev.  8-98} 

Prescribed  by  ANSI  Std  Z39-18 


Contents 


1  Introduction  1 

2  Related  Work  3 

3  Theory  4 

3.1  The  Setting .  4 

3.2  Basic  Definitions  .  5 

3.3  Rejection  Based  Classifiers .  6 

3.4  Composite  Rejectors .  7 

3.5  Time  Analysis .  8 

3.6  Space  Analysis  .  8 

4  Construction  of  Composite  Rejectors  9 

4.1  Notation .  9 

4.2  The  Class  Assumption .  10 

4.3  Relationship  with  the  K-L  Expansion .  11 

4.4  Verifying  the  Class  Assumption .  11 

4.5  Derivation  of  a  General  Purpose  Rejector .  11 

4.6  Choice  of  the  Rejection  Vector .  13 

5  Experiments  14 

5.1  3-D  Object  Recognition  .  14 

5.2  Local  Feature  Detection .  20 

6  Discussion  23 

A  Estimation  of  the  Thresholds  24 

B  Implementation  of  the  Derived  Rejector  25 


1 


Abstract 


The  efficiency  of  pattern  recognition  is  criticaf  when  there  are  a  farge  number  of 
cfasses  to  be  discriminated,  or  when  the  recognition  afgorithm  must  be  appfied  a  farge 
number  of  times.  We  propose  and  anafyze  a  generaf  technique,  namefy  pattern  rejection, 
that  feads  to  great  efficiency  improvements  in  both  cases.  Rejectors  are  introduced  as 
afgorithms  that  very  quickfy  efiminate  from  further  consideration,  most  of  the  cfasses  or 
inputs  (depending  on  the  setting).  Importantfy,  a  number  of  rejectors  may  be  combined  to 
form  a  composite  rejector,  which  performs  far  more  effectivefy  than  any  of  its  component 
rejectors.  Composite  rejectors  are  anafyzed,  and  conditions  derived  which  guarantee  both 
efficiency  and  practicahty.  A  generaf  technique  is  proposed  for  the  construction  of  rejectors, 
based  on  a  singfe  assumption  about  the  pattern  cfasses.  The  generafity  is  shown  through  a 
cfose  refationship  with  the  Karhunen-Loeve  expansion.  Further,  a  comparison  with  Fisher’s 
discriminant  anafysis  is  incfuded  to  iUustrate  the  benefits  of  pattern  rejection.  Composite 
rejectors  were  constructed  for  two  appfications,  namefy,  object  recognition  and  focaf  feature 
detection.  In  both  cases,  a  substantial  improvement  in  efficiency  over  existing  techniques 
is  demonstrated. 


Index  Terms:  Pattern  recognition,  computational  efficiency,  pattern  rejection,  composite 
rejector,  object  recognition,  feature  detection,  edge  detection,  Fisher’s  discriminant  anafy¬ 
sis,  Karhunen-Loeve  expansion. 
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1  Introduction 


We  address  the  efficiency  of  pattern  recognition,  which  is  known  to  be  vitaf  when  the 
number  of  cfasses  invofved  is  farge.  An  exampfe  appfication  in  computationaf  vision  is 
object  recognition,  which  in  many  cases  can  be  reduced  to  a  cfassicaf  pattern  recognition 
probfem  [Murase  and  Nayar  95].  Of  particufar  importance  in  this  context,  is  the  growth 
rate  of  the  recognition  time  as  a  function  of  the  number  of  cfasses  (objects).  High  efficiency 
afso  proves  criticaf  when  the  recognition  afgorithm  must  be  appfied  a  farge  number  of  times. 
This  is  the  case  in  focaf  feature  detection  [Nafwa  93]  [Nayar  et  af.  95],  where  the  detector 
needs  to  be  appfied  at  every  pixef  in  an  image. 

We  propose  a  generaf  theory  that  resufts  in  substantiaf  efficiency  improvements  in 
both  of  the  above  scenarios.  The  theory  is  based  upon  the  centraf  notion  of  a  rejector.  A 
rejector  is  an  afgorithm  that  efficientfy  efiminates  from  further  consideration,  most  of  the 
farge  number  of  cfasses  (e.g.  objects  in  recognition)  or  inputs  (e.g.  focaf  image  windows  in 
feature  detection).  White  the  intuitive  concept  of  a  rejector  is  simpfe,  its  formahzation  is 
significant  since  it  feads  immediatefy  to  the  foffowing  important  observations  and  resufts 
that  constitute  the  proposed  theory: 

f.  The  definition  of  correctness  for  a  rejector  is  much  fess  constraining  than  that  for  a 
cfassifier  (recognizer).  In  particufar,  a  rejector  is  only  required  to  eliminate  most  of 
the  cfasses  or  inputs  most  of  the  time,  which  is  substantially  fess  demanding  than 
requiring  perfect  classification  all  of  the  time.  As  a  result,  rejectors  can  be  constructed 
that  are  far  more  efficient  than  corresponding  classifiers. 

2.  Although,  in  generaf,  a  rejector  does  not  provide  the  final  solution  to  the  pattern 
recognition  probfem,  it  significantly  reduces  the  number  of  possible  cfasses  or  inputs 
to  consider.  Consequently,  the  recognizer  can  dedicate  its  computationaf  resources 
to  a  much  smaller  number  of  candidates.  In  doing  so,  pattern  rejection  is  taking 
advantage  of  the  fact  that  the  average  case  complexity  of  the  recognition  probfem 
is  generally  far  fess  than  the  worst  case  complexity.  In  both  exampfe  applications 
mentioned  above,  namely  object  recognition  and  feature  detection,  this  is  the  case. 

3.  Perhaps  the  most  crucial  aspect  of  pattern  rejection  is  that,  since  a  rejector  eliminates 
a  farge  number  of  cfasses  (inputs),  the  task  remaining  after  applying  a  rejector  is  a 
smaller  instance  of  the  original  recognition  probfem.  Hence,  a  collection  of  rejectors 
may  be  combined  recursively  in  a  directed  acyclic  graph  structure  to  form  a  composite 
rejector.  At  each  node  of  the  composite  rejector  is  a  simpfe  (as  opposed  to  composite) 
rejector.  Significantly,  each  such  simpfe  rejector  may  be  individually  designed  for  the 
set  of  cfasses  (inputs)  not  eliminated  by  the  parent  rejector  in  the  graph. 

Each  application  of  the  composite  rejector  corresponds  to  a  path  through  the  directed 
acyclic  graph.  At  each  node  in  the  path,  the  associated  simpfe  rejector  is  applied 
thereby  eliminating  more  of  the  cfasses  (inputs).  Since  each  subsequent  rejector 
is  constructed  for  a  smaller  subset  of  the  cfasses  (inputs),  child  rejectors  are  able 
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to  eliminate  classes  (inputs)  that  their  predecessors  where  not  able  to.  Overall,  the 
recursive  structure  results  in  the  composite  rejector  having  much  more  discriminatory 
power  than  any  of  its  component  rejectors. 

4.  Another  very  important  property  of  composite  rejectors,  is  that  it  is  possible  to  ana¬ 
lyze  their  performance  in  terms  of  the  performance  of  their  components.  For  instance, 
we  derive  conditions  that  guarantee  logarithmic  time  complexity  of  recognition,  in 
terms  of  the  total  number  of  classes  involved.  We  also  analyze  the  preprocessing  and 
space  requirements  of  a  composite  rejector,  in  particular  providing  conditions  that 
ensure  practicality. 

5.  We  propose  a  simple  general  purpose  technique  for  constructing  the  component  re¬ 
jectors  of  a  composite  rejector.  The  technique  is  based  on  a  single  assumption  about 
the  nature  of  the  pattern  classes,  namely,  the  class  assumption.  The  generality  of  the 
class  assumption  is  establishing  by  exhibiting  a  close  relationship  with  the  Karhunen- 
Loeve  (K-L)  expansion  [Fukunaga  90]  [Oja  83].  Hence,  we  expect  the  proposed  re¬ 
jection  technique  to  be  applied  successfully  in  any  application  for  which  the  K-L 
expansion  is  benehcial. 

We  demonstrate  the  signihcance  of  pattern  rejection  via  experiments  on  applica¬ 
tions  in  appearance  matching  based  object  recognition  [Murase  and  Nayar  95]  and  feature 
detection  [Nayar  et  al.  95].  First,  we  constructed  a  composite  rejector  for  a  widely  used 
image  database  of  20  objects,  each  of  which  constitutes  a  pattern  class.  The  appear¬ 
ance  of  each  object  changes  considerably  as  the  pose  of  the  object  varies.  However,  the 
composite  rejector  was  able  to  completely,  and  without  error,  discriminate  between  all  20 
objects.  The  efficiency  is  shown  to  be  a  substantial  improvement  over  the  technique  used  in 
[Murase  and  Nayar  95],  which  similarly  achieved  perfect  recognition.  We  also  empirically 
illustrate  logarithmic  growth  in  the  time  complexity  of  the  composite  rejector.  Further, 
when  compared  with  Fisher’s  discriminant  analysis  [Duda  and  Hart  73],  the  composite  re¬ 
jector  is  seen  to  be  both  signihcantly  more  efficient  as  well  as  more  accurate.  Discriminant 
analysis,  even  at  its  peak  performance,  has  an  error  rate  of  slightly  over  3%,  in  contrast  to 
the  error-free  performance  of  the  composite  rejector.  Finally,  we  constructed  a  composite 
rejector  for  the  task  of  feature  detection.  This  results  in  a  very  efficient  method  of  pre¬ 
processing  an  image  to  identify  pixels  that  truly  deserve  the  application  of  a  full-fledged 
feature  detector,  such  as  the  one  proposed  in  [Nayar  et  al.  95]. 

The  remainder  of  this  paper  is  organized  as  follows.  In  Section  2  we  discuss  the 
relationship  of  pattern  rejection  to  previous  work.  We  proceed  in  Section  3  to  introduce 
the  notions  of  a  rejector  and  of  a  composite  rejector.  We  also  analyze  the  time  and  space 
complexities  of  composite  rejectors.  In  Section  4,  we  describe  the  construction  of  the  indi¬ 
vidual  rejectors  that  go  to  form  a  composite  rejector.  Section  5  presents  our  experimental 
results,  and  Section  6  concludes  the  paper  with  a  brief  discussion  of  this  and  future  work. 
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2  Related  Work 


The  recursive  structure  of  the  composite  rejector  constitutes  a  decision  tree,  or  more  gen¬ 
erally  a  directed  acyclic  graph.  A  complete  survey  of  work  that  use  such  a  structure 
is  well  beyond  the  scope  of  this  paper,  but  a  small  selection^  is  [Henrichon  and  Fu  69] 
[Payne  and  Meisel  77]  [Weng  94],  In  general,  a  composite  rejector  has  a  directed  acyclic 
graph  structure,  as  opposed  to  simply  a  tree  structure,  because  there  are  many  different 
orders  in  which  classes  may  be  rejected.  Hence  there  may  be  a  large  number  of  different 
possible  paths  leading  to  any  one  node  in  the  composite  rejector.  Connections  can  also  be 
drawn  between  our  results  and  the  large  body  of  work  on  computationally  motivated  near¬ 
est  neighbor  classihers  [Friedman  et  al.  77]  [Bentley  80]  [Yianilos  93].  Though  the  problem 
we  address  is  somewhat  similar,  namely,  efficient  classihcation,  our  setting  is  more  general. 

The  major  novelty  of  our  approach  is  the  central  role  played  by  the  pattern  classes, 
ft  is  this  which  leads  to  the  composite  rejectors  having  a  directed  acyclic  graph  structure, 
rather  than  a  tree  structure.  Existing  work  which  is  concerned  with  complexity,  either 
models  the  classes  as  collections  of  points,  or  studies  partitions  of  space.  Hence,  our 
efficiency  results  are  in  terms  of  the  number  of  classes,  rather  than  the  number  of  sample 
points,  or  the  extent  to  which  space  is  partitioned.  We  regard  this  class-centered  approach 
a  more  natural  model  of  the  problem.  Importantly,  it  also  focuses  attention  on  what 
we  believe  to  be  the  key  question:  What  properties  must  the  pattern  classes  possess  for 
recognition  to  be  performed  efficiently?  The  introduction  of  the  class  assumption  is  an 
attempt  to  answer  this  question,  and  to  characterize  what  it  means  for  a  pattern  class  to 
have  a  “simple”  rather  than  a  “complex”  decision  boundary. 

A  relationship  can  be  established  between  our  technique  for  rejector  construction  and 
Fisher’s  discriminant  analysis  [Fisher  36]  [Duda  and  Hart  73].  In  particular,  our  rejection 
vector  will  be  seen  intuitively  to  maximize  between-class  scatter,  while  keeping  within-class 
scatter  hxed  at  a  low  level.  The  major  differences  between  rejection  theory  and  discriminant 
analysis  are  the  following: 

1.  Discriminant  analysis  is  presented  as  a  single  level  of  processing.  On  the  other  hand,  a 
composite  rejector  has  a  hierarchical  structure,  which  leads  to  superior  performance. 
In  particular,  the  relative  performance  is  accounted  for  by  the  fact  that  child  rejectors 
are  individually  constructed  for  reduced  subsets  of  classes.  Further,  the  second  and 
subsequent  Fisher  vectors  can  be  regarded  as  suboptimal  when  compared  to  the 
rejection  vectors  of  the  children  in  the  composite  rejector.  Weng  [Weng  94]  uses  a 
similar  hierarchical  structure  which  also  takes  advantage  of  this. 

2.  Whereas  rejection  is  geared  towards  the  computational  efficiency  of  recognition,  dis¬ 
criminant  analysis  is  concerned  with  representational  compactness.  This  paper,  in 
part,  illustrates  the  relationship  between  the  two.  Pattern  rejection  can  be  regarded 
as  an  attempt  to  bring  together  ideas  from  the  nearest  neighbor  literature,  which  is 

brief  discussion  on  decision  trees  can  also  be  found  in  [Duda  and  Hart  73]. 
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primarily  concerned  with  complexity  issues,  and  the  pattern  recognition  literature, 
which  is  more  concerned  with  representational  issues. 

3.  A  weakness  of  discriminant  analysis  is  that  there  is  little  known  about  when  it  can 
be  expected  to  work.  In  contrast,  for  rejectors,  our  results  provide  much  insight  into 
this  issue.  Central  in  this  respect  is  the  class  assumption. 

3  Theory 

In  this  section,  we  begin  by  dehning  both  classihers  and  rejectors  as  algorithms.  The 
notion  of  a  rejection-based  classiher  is  introduced  and  its  efficiency  is  discussed.  Next,  the 
general  concept  of  a  composite  rejector  is  put  forth  and  its  time  and  space  requirements 
are  analyzed. 


3.1  The  Setting 

A  pattern  recognition  task  is  always  based  on  a  hnite  set  of  measurements  of  an  underlying 
physical  process.  In  this  paper,  we  restrict  attention  to  the  case  where  the  measurements 
consist  of  real  numbers.  However,  even  if  the  measurements  are  discrete  valued,  it  is  often 
both  simple  and  desirable  to  convert  them  into  reals.  Hence,  we  assume  the  existence  of 
a  classification  space,  A  =  3?'^,  where  the  integer,  d,  is  the  number  of  measurements  taken. 
Elements,  x  E  S,  will  be  referred  to  as  measurement  vectors,  or  for  convenience,  vectors. 

For  each  pattern  class  that  is  to  be  recognized,  we  can,  at  least  conceptually,  consider 
the  set  of  measurement  vectors  that  should  ideally^  be  classihed  as  belonging  to  that  class. 
So,  we  assume  the  existence  of  a  hnite  collection,  Wi,  W2, .  .  . ,  HA  C  S,  of  pattern  classes, 
or  simply  classes.  The  classes  themselves  are  dehned  by  the  application  in  question  and  we 
will  therefore  assume  that  they  are  given  to  us  a  priori^. 

^Our  model  of  pattern  recognition  is  deterministic  in  the  sense  that  for  every  measurement  vector 
and  every  class,  the  vector  is  either  a  member  of  the  class,  or  not  a  member  of  the  class.  Probabilistic 
(Bayesian)  models,  where  a  measurement  vector  is  assigned  a  probability  of  being  a  member  of  a  given  class, 
are  subsumed  by  this  model.  Since  probabilistic  models  are  meaningless  in  isolation  and  without  a  decision 
theory,  we  can  regard  the  classes,  Wi,  as  being  defined  by,  say,  the  Bayes  decision  rule  [Duda  and  Hart  73]. 

^There  are  a  number  of  ways  in  which  the  classes  can  be  obtained  [Fukunaga  90].  One  possibility  is 
that  the  classes  are  derived  using  an  analytical  model  of  the  underlying  physical  process,  such  as  is  the  case 
in  parametric  feature  detection  [Nalwa  93]  [Nayar  et  al.  95].  In  many  applications,  however,  modeling  the 
underlying  physical  process  proves  extremely  difficult.  Then,  the  classes  are  often  empirically  estimated 
using  sample  measurement  vectors  of  known  classification.  This  procedure,  which  relies  on  some  form  of 
interpolation  between  sample  points,  has  been  the  most  widely  studied  problem  in  pattern  recognition.  For 
our  purposes,  we  assume  that  that  an  appropriate  model  of  interpolation  has  been  decided  upon,  which 
then  defines  the  classes  W,.  We  proceed  to  address  efficient  classification. 
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3.2  Basic  Definitions 


A  classifier  is  simply  an  algorithm  that  returns  the  class  label  (if  any)  of  the  class  in  which 
the  input  measurement  vector  lies: 

Definition  1  A  classifier  is  an  algorithm,  (f),  that  given  an  input,  x  E  S ,  returns  the  class 
label,  i,  of  the  class‘d  for  which  x  E  Wi.  If^i,  x  ^  Wi,  the  classifier,  f,  returns  nothing. 

We  now  introduce  a  rejector  as  a  generalization  of  a  classifier,  ft  is  a  generalization 
in  two  senses:  (a)  a  rejector  returns  a  set  of  class  labels  rather  than  a  single  label,  and 
(b)  although  the  set  of  labels  must  contain  the  label  which  a  correctly  functioning  classifier 
would  return,  it  is  also  allowed  to  contain  more: 

Definition  2  A  rejector  is  an  algorithm,  f),  that  given  an  input,  x  E  S,  returns  a  set  of 
class  labels,  ip{x),  such  that  x  E  Wi  i  E  if{x)  (or  equivalently  i  ^  'f>[x)  x  ^  Wi) . 

The  name  rejector  comes  from  the  equivalent  definition:  i  ^  'f>[x)  x  ^  Wi.  That  is,  if 
i  is  not  in  the  output  of  the  rejector,  we  can  safely  eliminate®  the  class,  Wi,  from  further 
consideration.  On  the  other  hand,  if  z  6  fp{x),  we  cannot  be  sure  whether  x  E  Wi  or  not. 
For  notational  convenience,  we  now  introduce  the  term,  rejection  domain,  for  the  set  of  all 
X  E  S  for  which  the  class,  Wi,  can  be  rejected: 

Definition  3  If  is  a  rejector,  and  Wi  is  a  class,  then  the  rejection  domain,  Rf ,  off), 
for  class,  Wi,  is  the  set  of  all  x  E  S  for  which  i  ^  ij){x). 

Then,  the  following  three  important  properties  hold: 

1.  For  any  valid  rejector,  f),  each  class,  Wi,  and  its  corresponding  rejection  domain,  Rf , 
are  disjoint  (Rf  fllW  =  0).  This  follows  immediately  from  the  above  definitions  since: 

X  E  Wi  (Def’n  2)  i  E  'ij)(x)  (Def’n  3)  x  ^  Rf  (1) 

2.  Subject  to  the  one  constraint  that,  Rf  fl  W  =  0,  we  are  completely  free  to  choose 
the  rejection  domains  and  still  conform  with  the  correct  definition  of  a  rejector: 

X  e  w  (Rf  n  w  =  0)  ^  x  ^  Rf  (Def’n  3)  ^  lE  f){x)  (2) 

The  resulting  freedom  to  choose  rejection  domains  with  “simple”  decision  boundaries, 
is  what  allows  rejectors  to  be  efficient. 

■^This  definition  of  a  classifier  implicitly  assumes  that  the  classes,  Wi,  are  disjoint,  which  is  often  the 
case.  Generalization  to  the  non-disjoint  case  is  straightforward. 

® Although  this  is  phrased  as  the  class,  Wi,  being  eliminated,  more  generally,  we  can  think  of  the  pair  of 
input  and  class,  (x,  Wi),  as  being  rejected.  Hence,  depending  on  the  setting,  we  can  view  either  the  input, 
X,  or  the  class,  Wi,  as  being  ruled  out. 
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3.  As  argued  above,  the  rejector,  t{),  can  be  used  to  eliminate  Wi  from  further  consid¬ 
eration  if  and  only  if  z  ^  ^.nd  by  Dehnition  3,  that  is  if  and  only  if  x  6  iff. 

Hence,  to  reject  as  many  classes  (inputs)  as  possible,  we  should  aim  to  choose  the 
rejection  domains,  iff,  to  be  as  large  as  possible.  However,  there  is  a  trade-off  be¬ 
tween  maximizing  iff,  ensuring  iff  f^Wi  —  0,  and  using  simple  decision  boundaries 
for  efficiency. 

3.3  Rejection  Based  Classifiers 

Applying  a  rejector  does  not  guarantee  that  we  will  always  be  able  to  uniquely  classify  an 
input,  since  there  may  be  more  than  one  class  in  the  output  of  the  rejector.  We  deal  with 
this  potential  ambiguity  by  adding  a  verihcation  stage: 

Definition  4  A  veriher  for  a  class  Wi  is  a  boolean  algorithm  which,  given  an  input,  x  E  S , 
returns  the  result,  1,  if  x  is  a  member  ofWi,  and  0  otherwise. 

We  form  a  rejection-based  classifier,  ,  by  hrst  applying  a  rejector,  tp,  and  then 
applying  a  veriher  for  each  class,  Wi,  where  i  6  'f>{x),  the  output  of  the  rejector.  From  the 
outputs  of  the  verihers,  we  can  immediately  classify  the  input,  x  E  S.  The  efficiency  of  the 
rejection-based  classiher  is  given  by: 

=  Ta.irf)  +  •  T^er  (3) 

where,  is  the  average  run  time  of  the  rejection-based  classiher,  Tavif’)  is  the  average 

run  time  of  the  rejector,  i?2:es(|V’(^)l)  is  the  expected  cardinality  of  the  rejector  output, 
and  Tyer  is  the  run  time  of  each  of  the  verihers  (assumed  to  be  the  same  for  all  verihers). 
Equation  (3)  is  derived  by  noting  that  we  must  always  apply  the  rejector  (which  contributes 
the  term,  Tav{fp))  and  that  on  average  we  must  apply  i^^rGsdV’l^)!)  verihers. 

The  reason  for  introducing  a  rejection-based  classiher  is  that  we  aim  to  be  able 
to  construct  very  efficient  rejectors  which  are  also  very  good  at  eliminating  most  of  the 
classes.  Hence,  both  and  i?2:Gs(|V’(^)|)  will  be  small  quantities,  leading  to  efficient 

classihcation.  With  Tav{fp)  as  the  measure  of  the  efficiency  of  a  rejector,  we  now  introduce 
effectiveness  as  a  measure  of  how  well  a  rejector  eliminates  classes: 

Definition  5  If  ip  is  a  rejector  designed  for  the  n  classes,  Wi, .  .  .  ,Wn,  we  define  the  ef¬ 
fectiveness  of  Ip  by: 

Eff(V))  =  ^4) 

n 

Note  that  a  small  numeric  value  of  Eff(^)  corresponds  to  an  “effective”  rejector.  Then, 
equation  (3)  shows  that  a  rejection-based  classiher  will  be  efficient  when:  (a)  rejection  is 
efficient,  and  (b)  rejection  is  effective. 
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3.4  Composite  Rejectors 


As  we  will  see,  constructing  very  efficient  rejectors  is  straightforward.  However,  in  some 
applications,  these  rejectors  tend  to  be  less  effective  than  might  be  hoped  for.  Although  a 
rejector  may  eliminate  a  large  percentage  of  the  classes,  on  average  a  substantial  number 
may  also  be  left  as  possibilities.  However,  since  the  output  of  a  rejector  is  a  subset  of 
classes  (which  is  simply  a  smaller  instance  of  the  original  classihcation  problem),  we  may 
recursively  apply  another  rejector.  If  the  new  rejector  is  specihcally  designed  for  the  reduced 
subset  of  classes,  it  may  well  be  able  to  eliminate  some  classes  which  the  original  rejector 
was  unable  to.  The  result  of  a  such  a  combination  of  rejectors  is  a  signihcant  improvement 
in  the  overall  effectiveness.  This  is  the  notion  of  a  composite  rejector: 

Definition  6  A  composite  rejector,  T,  is  a  collection  of  rejectors,  T  =  {ipj  :  ^  6  SJ}, 

where  is  an  index  set  for  T,  such  that: 

(a)  Each  rejector  in  T  is  designed  to  be  applied  to  some  subset  of  {HA, .  .  . ,  W„} 

(b)  There  is  a  rejector  in  T  designed  for  the  complete  set  of  classes,  {HA,  •  ■  •  ,  HA} 

(c)  For  any  rejector,  E  T,  and  any  x  E  S,  either  |^j(^)l  A  1  or  there  is  a  rejector 
in  T  designed  for  the  subset  of  classes,  {HA  :  i  G 

As  indicated  above,  a  composite  rejector  is  applied  by  hrst  applying  the  rejector 
designed  for  the  complete  set  of  classes.  This  yields  a  subset  of  the  class  labels  and  a 
reduced  instance  of  the  classihcation  problem.  By  requirement  (c)  in  the  above  dehnition, 
the  composite  rejector  contains  a  rejector  designed  for  the  reduced  set  of  classes.  Hence, 
we  can  repeatedly  apply  rejectors  in  this  manner,  systematically  reducing  the  number  of 
classes  at  each  step,  until  we  are  left  with  only  one  class.  Alternatively,  we  may  terminate 
if  the  rejector  fails  to  result  in  a  reduction  in  the  number  of  remaining  classes  (and  there 
are  no  other®  unapplied  rejectors  in  T  designed  for  the  current  set  of  classes). 

The  composite  rejector  is  laid  out  in  the  form  of  a  directed  acyclic  graph.  Each 
rejector,  ^l)^  G  T,  and  the  subset  of  classes  for  which  it  was  designed,  corresponds  to  a 
node  in  the  graph.  There  is  a  directed  edge  from  the  node  corresponding  to  to  that 
corresponding  to  'ipj,  if  and  only  if  there  is  a  measurement  vector,  x  G  A,  such  that  ifj 
was  designed  for  the  subset  of  classes,  {HA  :  i  G  'f>fx)'\.  (If  'ft  and  'fj  are  designed  for 
the  same  set  of  classes,  to  preserve  acyclicity,  we  only  include  this  edge  if  ^  <  j.)  Hence, 
the  application  of  the  composite  rejector  to  any  measurement  vector  corresponds  to  a  path 
through  the  directed  acyclic  graph.  At  each  node  in  the  path,  the  associated  rejector 
is  applied  and  its  output  determines  the  edge  that  should  be  taken  to  leave  the  node. 
Before  we  detail  the  construction  of  composite  rejectors,  we  analyze  their  time  and  space 
requirements. 

®It  is  entirely  possible  within  our  definition  that  a  composite  rejector  may  contain  several  rejectors 
designed  for  the  same  set  of  classes.  This  allows  the  opportunity  to  try  a  number  of  alternative  rejectors, 
increasing  the  chance  that  one  of  them  will  be  successful  in  rejecting  some  of  the  classes.  Using  multiple 
rejectors  in  this  way  is  especially  useful  in  cases  where  there  is  just  one  class  to  be  recognized,  for  example 
in  feature  detection. 
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3.5  Time  Analysis 


Intuitively,  the  motivation  for  introducing  a  composite  rejector  is  that  designing  a  rejector 
for  a  reduced  set  of  classes  should  be  easier  and  enable  the  rejection  of  classes  not  previously 
eliminated.  Hence,  we  expect  that  the  composite  rejector  will  be  far  more  effective  than 
any  of  its  individual  constituent  rejectors,  at  the  cost  of  a  slight  reduction  in  efficiency. 
The  recursive  structure  of  the  composite  rejector  leads  us  to  expect  the  complexity  of  the 
resulting  rejection-based  classiher  to  be  logarithmic  in  the  number  of  classes.  Sufficient 
conditions  to  prove  such  a  result  are  as  follows: 

1.  For  all  6  T  and  x  E  either  |^/)j(x)|  <  1,  or  at  least  one  class  is  eliminated  by 

2.  With  respect  to  the  underlying  a  priori  probability  density  function  from  which  the 
measurement  vectors  are  drawn,  the  events,  ,  are  mutually  independent. 

3.  The  effectiveness  of  all  the  component  rejectors  is  the  same:  V^  G  SJ,  Eff(V)j)  =  E,  say. 

Then,  a  composite  rejector  truncated  to  apply  at  most  k  simple  rejectors,  has  an  effective¬ 
ness  bounded  above  by  ^  +  E*.  This  follows  from  the  fact  that  condition  1  given  above 
implies  that  either  we  have  at  most  one  class  left,  or  we  have  another  rejector  to  apply. 
The  hrst  case  is  covered  by  the  term  and  so  we  can  assume  the  second  case  applies. 
Conditions  2  and  3  ensure  that,  for  each  subsequent  rejector,  the  effectiveness  is  reduced 
by  a  factor  of  E,  hence  the  term  .  Setting  k  —  [log^-i  n]  and  then  using  the  truncated 
composite  rejector  in  a  rejection-based  classiher,  we  have  logarithmic  time  complexity: 

Tavif’')  <  Rogg-l  n]  ■  Trej  +  2  •  Tyer  (5) 

where,  Rej  is  the  run  time  of  each  of  the  rejectors  in  T  (assumed  constant),  and  n  is  the 
number  of  classes. 


3.6  Space  Analysis 

A  potential  problem  with  the  composite  rejector  is  that  the  number  of  rejectors  within  T 
may  become  very  large,  possibly  as  many  as  2”,  the  number^  of  subsets  of  {lEi, .  .  . ,  1E„}.  To 
avoid  such  exponential  growth  in  the  space  and  preprocessing  requirements  of  the  composite 
rejector,  we  must  impose  further  constraints  on  each  rejector,  E  T.  We  require  that: 

1.  For  each  simple  rejector,  the  number  of  different  possible  output  subsets  is  two. 

2.  The  two  possible  output  subsets  of  classes  are  of  equal  cardinality. 

^There  may  possibly  be  more  if  we  have  several  rejectors  per  subset  of  classes.  So,  in  what  follows,  we 
will  assume  that  we  construct  at  most  one  rejector  for  any  given  subset  of  classes.  For  similar  reasons,  we 
also  assume  that  we  only  construct  rejectors  if  they  can  actually  be  reached  from  the  initial  rejector,  that 
is,  the  one  for  the  complete  set  of  classes. 
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3.  The  intersection  between  the  two  outputs  consists  of  at  most  a  fraction,  e  6  [0,  f ),  of 
the  number  of  classes  for  which  the  rejector  was  constructed. 

Then,  if  we  denote  by  M(n)  the  maximum  number  of  rejectors  in  T  that  may  be  reached 
after,  and  including,  a  rejector  constructed  for  a  collection  of  n  classes,  we  have: 

M(n)  <  1  +  2  •  M((l  +  e)  •  n/2)  (6) 

By  induction  on  n,  it  can  be  proven  that  M(n)  is  polynomial  in  n\ 

M{n)  <  _  1  (7) 

For  e  =  0,  the  bound  is  M{n)  <  n  —  1,  and  for  e  =  —  1  ps  0.41,  it  is  M{n)  < 

In  practice,  it  may  not  be  possible  to  completely  satisfy  the  three  requirements 
stated  above.  However,  the  following  three  design  criteria  may  be  used  as  guidelines  while 
implementing  each  rejector  in  the  composite  rejector,  T:  (a)  avoid  rejectors  that  produce  a 
large  number  of  outputs,  (b)  attempt  to  balance  the  output  cardinalities,  and  (c)  minimize 
the  overlap  between  the  outputs. 


4  Construction  of  Composite  Rejectors 

As  explained  in  the  previous  section,  a  composite  rejector  has  a  hierarchical  structure  that 
includes  a  number  of  simple  rejectors  as  components.  We  now  describe  the  general  purpose 
technique  used  to  construct  each  of  the  component  rejectors.  The  composite  rejector  is 
then  formed  by  recursively  building  the  component  rejectors,  starting  with  one  for  the 
complete  set  of  classes.  Depending  upon  the  application,  alternative  methods  of  rejector 
construction  may  be  possible.  If  so,  they  can  easily  be  combined  with  the  following  in  the 
composite  rejector. 


4.1  Notation 

We  write  the  Euclidean  inner  (dot)  product  of  two  vectors,  x,  y  6  A,  as,  (x,  y)  —  '  Vi-i 

where  x^,  are  the  coordinates  of  the  vectors  x^y  E  S  respectively.  The  induced  Euclidean 
in  norm  we  denote  by  ||x||  =  (x,  x)^^^.  We  assume  that  the  norm  of  a  vector  is  unimportant 
for  classihcation  purposes,  ft  is  only  the  direction  of  the  vector  that  matters.  Hence,  we 
restrict  attention®  to  the  surface  of  the  unit  ball,  B  —  {x  E  S  :  ||x||  =  1}.  We  will  assume 
that  both  the  measurement  vectors,  and  the  classes,  ITi,  IT2, .  .  . ,  HA,  have  been  normalized 
and  thus  lie  in  B.  Normalization  can  be  achieved  by  replacing  x  6  A,  with  ||^  6  B. 

®The  assumption  that  all  the  vectors  lie  in  B  is  not  restrictive  in  the  following  sense.  It  is  possible  to  code 
the  magnitude  of  a  vector,  x  G  S'  =  in  a  vector  of  unit  norm  in  The  vector,  x  =  (xi,  X2, . . . ,  x,j)^, 

is  replaced  with,  x'  =  (xi,  X2, . . . ,  Xd,  1)^,  and  then  x'  is  normalized.  The  magnitude  of  x  is  then  encoded 
in  the  last  coordinate  of  xY||x''||. 
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Figure  1:  An  illustration  of  the  class  assumption  for  a  low  dimensional  example,  S  =  3?^.  The 
subspace,  Li,  is  the  2  dimensional  subspace  spanned  by  the  vectors,  {€1,62}-  Every  vector  in  Wi 
can  be  approximated  to  within  an  error.  Si,  by  the  linear  combination  of  Cj  and  a  vector  in  ij. 
The  rejection  vector,  r,  is  a  unit  vector  orthogonal  to  the  subspace  ij.  The  rejection  domain, 
Rf'',  of  the  derived  rejector,  tpr,  consists  of  all  points,  x  £  B,  which  have  a  projection  in  the 
direction  of  the  rejection  vector,  r,  at  least  Si  away  from  the  projection  of  c^. 

4.2  The  Class  Assumption 

Designing  a  rejector  is  equivalent  to  deciding  on  the  rejection  domains  associated  with  each 
of  the  classes.  Since  for  correctness  we  require  that  Rf  fl  Wi  —  0,  the  choice  of  rejection 
domains  depends  heavily  on  the  nature  of  the  underlying  classes.  Hence,  we  make  the 
following  assumption,  which  is  illustrated  in  Figure  1: 

Class  Assumption:  For  each  class,  Wi,  there  exists  a  vector,  Ci  £  S,  a  linear  subspace, 
Li  C  S,  and  a  threshold.  Si  >  0,  such  that  \/x  £  Wi,  dist(x,  Cj  +  Tj)  <  Si.  Further  we  assume, 
that:  (a)  dim(Tj)  <C  d,  and  (b)  Si  <C  1. 

It  is  therefore  assumed  that  any  vector  in  the  class  can  be  approximated  to  within 
a  small  error,  by  a  linear  combination  of  a  hxed  vector,  Cj,  and  a  vector  in  the  subspace, 
Li.  For  the  class  assumption  to  be  useful,  it  must  be:  (a)  general  enough  to  apply  in  a  large 
number  of  applications,  and  (b)  restrictive  enough  to  facilitate  the  construction  of  rejectors 
which  are  both  efficient  and  effective.  In  the  following  sections,  we  first  demonstrate  the 
generality  of  the  class  assumption  by  showing  its  relationship  with  the  Karhunen-Loeve 
(K-L)  expansion.  Then,  we  proceed  to  show  how  the  class  assumption  leads  to  a  efficient 
and  effective  general  form  for  a  rejector. 
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4.3  Relationship  with  the  K-L  Expansion 


The  class  assumption  can  be  seen  to  be  very  general  and  allows  many  different  forms  for 
the  classes,  W^,  including,  for  instance,  disconnected  multi-cluster  distributions.  Its  true 
generality  can  be  demonstrated  by  noting  that  it  is  approximately  equivalent  to  assuming 
that  the  application  of  the  K-L  expansion  [Fukunaga  90]  [Oja  83]  results  in  a  compact  and 
accurate  representation  of  the  class,  Wi.  Suppose  that  is  the  subspace  spanned  by  the  k 
most  important  K-L  eigenvectors,  and  {A^  :  f  =  1, .  .  .  ,  d}  are  the  decaying  K-L  eigenvalues. 
Then,  we  have: 

d 

Fia:GM^,(dist(x,  ^  (rs  0)  (8) 

s=k-\-l 

Using  Cj  in  place  of  Ejr^w,{^)  place  of  M^,  we  see  that  the  sole  difference  between 

the  class  assumption  and  the  K-L  expansion  is  one  of  expected  versus  maximum  value  of 
the  error  in  the  class  representation.  Hence,  the  widespread  use  of  the  K-L  expansion  allows 
us  to  argue  that  the  class  assumption  can  be  expected  to  hold  extensively. 


4.4  Verifying  the  Class  Assumption 

Since  the  K-L  expansion  may  be  computed  efficiently  (see  for  example  [Chittineni  81]  or 
[Murakami  and  Kumar  82]),  we  use  it  to  validate  the  class  assumption  and  moreover  to 
hnd  Li  and  Cj.  For  each  class,  lUp  we  put  c;  =  (^),  and  take  Li  to  be  the  subspace 

spanned  by  the  k  most  important  K-L  eigenvectors.  With  these  estimates  in  place,  it  is 
straightforward  to  check  if  the  maximum  representation  error,  is  sufficiently  small.  (A 
better  method  of  selecting  the  thresholds  is  discussed  in  Appendix  A.) 

Inherent  in  the  class  assumption  is  a  trade-off.  If  we  are  prepared  to  accept  the  use 
of  a  subspace  with  higher  dimensionality,  we  can  expect  to  be  able  to  reduce  dj.  Similarly, 
if  we  reduce  k  —  dim(Tj),  we  will  generally  need  to  increase  6i.  There  is  an  equivalent 
trade-off  in  the  Karhunen-Loeve  expansion  between  the  compactness  and  accuracy  of  the 
representation.  In  our  implementation,  the  value  of  k  is  dependent  on  the  particular  class, 
and  is  chosen  by  thresholding  the  sum  of  the  discarded  eigenvalues. 


4.5  Derivation  of  a  General  Purpose  Rejector 

Given  that  the  class  assumption  holds,  we  are  now  in  a  position  to  derive  a  general  form 
for  a  rejector.  We  begin  by  dehning  the  notion  of  a  rejection  vector,  which  is  illustrated  in 
Figure  1: 

Definition  7  Suppose  the  class  assumption  holds  for  lUi, .  .  . ,  IU„.  Then  a  rejection  vector 
is  a  unit  vector,  r  E  B ,  for  which  r  T  0”=i  (equivalently,  r  is  orthogonal  to  Li  for  all  i ). 
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Figure  2:  The  effect  of  applying  a  rejection  vector,  r.  The  fnnction  x  {‘r,x)  maps  each  class, 
Wi,  onto  the  interval,  [(r,  Cj)  —  Si,  {r,Ci)  +  Si].  So  long  as  the  Si  are  small  and  the  centers  of  the 
intervals,  {r,Ci),  are  well  separated,  most  pairs  of  intervals  will  not  intersect.  This  makes  the 
rejection  vector  a  highly  effective  one.  Given  a  measnrement  vector,  x  £  B,  the  ontpnt  of  the 
derived  rejector,  'ipr{x),  consists  of  all  i  for  which  (r,  x)  lies  in  the  interval  [(r,  Cj)  —  Si,  {r,  Ci)  +  Si]. 

If  r  is  a  rejection  vector,  it  follows  from  the  class  assumption,  the  Cauchy- Schwarz  inequality 
and  orthogonality,  respectively,  that: 

X  e  IF*  e  L,  :  ||c*  +  /*-x||  <  K^,  c*  +  /*-x)|  <  \{r,c,)-{r,x)\  <  6,  (9) 

Equation  (9)  means  that  the  rejection  vector,  r,  projects  every  measurement  vector  of 
an  entire  class,  IF*,  onto  the  subinterval,  [(r,  c*)  —  6i,{r,Ci)  +  ^j],  of  the  real  line.  Since 
<C  1,  this  interval  is  almost  a  point,  and  so  a  compact  characteristic  of  the  class.  More 
importantly,  if  a  measurement  vector  is  not  projected  into  the  short  interval,  [(r,  c^)  — 
Si,  {r,  Ci)  +  the  class,  IF^,  can  be  safely  rejected. 

So  long  as  the  thresholds.  Si,  are  small,  and  the  centers  of  the  intervals,  {r,Ci),  are 
well  spread  out,  the  intervals  themselves  will  not  overlap  signihcantly,  as  illustrated  in 
Figure  2.  Then,  we  can  easily  discriminate®  between  the  classes  based  on  their  projections, 
and  so  dehne  the  derived  rejector  as  follows: 

Definition  8  Given  that  the  class  assumption  holds  for  the  classes  IFi,  IF2, .  .  . ,  1F„,  and 
that  r  £  B  is  a  rejection  vector,  we  define  the  derived  rejector,  fir,  by: 

i  e  fir{x)  Kr,  x)  -  (r,  c*)|  <  (10) 

Hence,  the  derived  rejector  returns  the  class  labels  of  any  classes,  IF*,  for  which  the  point, 
(r,  x),  lies  in  the  interval,  [(r,  c*)  —  Si,  fir,  cfi  +  that  is  the  class  labels  of  any  class  from 

®There  is  no  guarantee  that  we  will  be  able  to  find  a  rejection  vector  that  will  completely  distinguish 
between  a  given  pair  of  classes.  For  example,  if  the  convex  hulls  of  the  classes  overlap,  their  projections 
with  any  rejection  vector  will  intersect.  Note,  however,  that  this  occurrence  need  not  effect  the  usefulness 
of  a  derived  rejector  since  the  goal  of  a  rejector  is  to  eliminate  most  of  the  classes  most  of  the  time,  as 
opposed  to  complete  discrimination  all  of  the  time.  We  are  implicitly  assuming  that  in  a  large  collection 
of  classes,  pairs  of  classes  which  are  difficult  to  discriminate  occur  relatively  rarely,  not  that  they  do  not 
occur  at  all.  In  our  object  recognition  application,  this  is  indeed  a  very  natural  assumption. 
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which  the  measurement  vector  might  have  come.  The  rejection  domain  of  for  the  class, 
W„  is  =  {x  e  5  :  |(r,  x)  —  (r,  Ci)|  >  ^j},  as  illustrated  in  Figure  1.  Equation  (9)  then 
shows  that  Wi  fl  Rf’’  —  0,  and  hence  is  well  dehned  as  a  rejector. 

The  derived  rejector  may  be  implemented  very  efficiently  as  follows.  (A  slightly  mod- 
ihed  method  more  appropriate  for  use  in  a  composite  rejector  is  described  in  Appendix  B.) 
First,  we  compute  the  projection  of  the  measurement  vector,  x  E  with  the  rejection 
vector,  to  give  (r,  x).  Then,  the  set  of  class  labels,  z,  for  which  (r,  x)  lies  in  the  interval, 
[(r,  c;)  —  (r,  c;)  +  ^j],  can  be  computed  with  [log2(2n  +  1)]  comparisons  and  a  lookup 

table.  This  is  possible  because  the  derived  rejector  is  a  piecewise  constant  function,  ft 
only  changes  its  value  at  the  2n  points,  (r,  c^)  ±  6i.  The  constant  values  on  the  intervening 
segments  can  easily  be  precomputed  and  stored  in  the  lookup  table.  Finding  the  segment 
in  which  (r,  x)  lies  takes  [log2(2n  +  1)]  comparisons  using  a  binary  search. 

4.6  Choice  of  the  Rejection  Vector 

We  have  seen  that  the  derived  rejector  can  be  applied  efficiently.  The  reason  we  can 
expect  it  to  be  effective  is  because  we  have  quite  some  freedom  in  choosing  the  direction 
of  the  rejection  vector,  r.  Thus  far,  r  has  only  been  constrained  to  lie  orthogonally  to 

i.  We  enforce  this  constraint  immediately  by  taking  each  vector  used  from  now  on, 
and  subtracting  the  component  in  the  space,  0”_i  Tj. 

As  Figure  2  shows,  we  should  choose  the  rejection  vector  to  be  the  one  that  spreads 
out  the  centers  of  the  intervals,  [(r,  Cj)  —  c^)  +  as  much  as  possible.  This  will 

reduce  the  size  of  and  so  tend  to  optimize  the  effectiveness  of  the  derived  rejector. 

If  variance  is  used  to  measure  the  spread  of  the  centers,  the  best  rejection  vector  to  choose 
is  the  hrst  Karhunen-Loeve  eigenvector^®,  that  is  the  one  with  the  largest  eigenvalue. 

If  there  is  just  one  class,  as  is  in  the  feature  detection  application,  the  K-L  expan¬ 
sion  cannot  be  applied  because  there  is  only  one  vector,  Cj.  In  this  situation,  we  select 
the  rejection  vector  uniformly  at  random  in  the  space  orthogonal  to  0”_i  Tj.  As  described 
in  [Knuth  81],  this  can  be  performed  by  drawing  the  d  coordinates  from  a  normal  distribu¬ 
tion,  projecting  out  the  component  in  0”_i  and  then  normalizing  to  obtain  a  rejection 
vector  that  lies  on  the  unit  sphere,  B. 

^°This  choice  of  rejection  vector  may  be  seen  to  be  closely  related  to  Fisher’s  discriminant  analysis 
[Fisher  36]  [Duda  and  Hart  73].  By  working  in  a  space  orthogonal  to  0"_i  Li,  we  are  limiting  the  within- 
class  scatter  of  each  class,  Wi.  Spreading  out  the  points  {r,Ci),  maximizes  the  between-class  scatter.  The 
important  difference,  however,  is  the  inherent  conservative  nature  of  the  derived  rejector,  which  ensures 
we  never  make  a  wrong  choice,  and  defers  difficult  decisions  to  subsequent  rejectors  which  are  in  a  better 
position  to  discriminate  between  the  difficult  cases. 
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Figure  3:  The  20  objects  used  for  recognition.  We  nsed  72  images  of  each  object,  with  consecntive 
images  separated  by  5°  of  pose.  The  data  set  is  the  same  as  nsed  iir  [Murase  and  Nayar  95]. 

5  Experiments 

The  theory  of  pattern  rejection  is  general  and  hence  should  hnd  use  in  a  variety  of  appli¬ 
cations.  Here,  our  objective  is  to  demonstrate  the  generality,  efficiency,  and  effectiveness 
of  composite  rejectors.  As  examples,  we  have  chosen  two  problems  in  computational  vi¬ 
sion,  namely,  3-D  object  recognition  and  feature  detection.  These  problems  were  selected 
as  they  can,  under  certain  assumptions,  be  cast  as  classical  pattern  recognition  problems. 
Furthermore,  both  problems  often  need  to  be  solved  with  high  efficiency. 


5.1  3-D  Object  Recognition 

There  are  several  approaches  to  3-D  object  recognition,  most  of  which  attempt  to  match  fea¬ 
tures  in  images  to  3-D  object  models  [Besl  and  Jain  85]  [Chin  and  Dyer  86].  Recently,  an 
alternative  approach  called  appearance  matching  has  gained  popularity,  where  objects  are 
modeled  as  collections  of  2-D  views  [Edelman  and  Weinshall  91]  [Poggio  and  Edelman  90] 
[Murase  and  Nayar  95].  The  main  advantages  and  limitations  of  appearance  matching  are 
described  in  [Murase  and  Nayar  95].  Similar  view-based  recognition  techniques  have  also 
been  applied  to  the  problem  of  face  recognition  [Pentland  et  al.  94]  [Brunelli  and  Poggio  93] 
[Sirovich  and  Kirby  87]  [Turk  and  Pentland  91]. 

In  our  experiments,  we  use  appearance  matching  simply  as  an  example  of  the  large 
class  of  problems  for  which  efficient  rejectors  can  be  constructed.  For  simplicity  we  as¬ 
sume  a  constrained  environment.  We  require  that  the  object  can  be  segmented  from  the 
background,  is  not  occluded  substantially,  and  appears  in  unknown  pose  but  in  one  of  a 
small  number  of  stable  conhgurations.  Also,  we  assume  that  the  illumination  of  the  envi¬ 
ronment  remains  more  or  less  unchanged.  Under  these  conditions,  appearance  matching, 
as  described  in  [Murase  and  Nayar  95],  reduces  object  recognition  to  a  classical  pattern 
recognition  problem.  We  hrst  segment  the  object,  and  then  scale  by  resampling  the  image 
so  that  the  larger  of  the  two  object  dimensions  hts  a  preselected  image  size.  In  our  imple¬ 
mentation,  the  image  size  was  128  X  128  pixels.  The  scale  normalized  image  is  then  treated 
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Figure  4:  An  example  rejector  for  the  set  of  objects,  {1,  5, 13, 18, 19}.  Following  the  procednre  in 
Appendix  B,  we  select  3  bnckets,  bi  =  [—1.0,0.18],  62  =  [0.18,0.29]  and  63  =  [0.29, 1.0].  If  {r,x) 
falls  in  bncket  bi,  the  rejector  retnrns  the  set  of  class  labels,  {1,  5, 19},  if  it  falls  in  62  the  rejector 
retnrns  {18},  and  if  it  falls  in  63  the  rejector  retnrns  {13}.  Since  the  nse  of  snch  a  rejector  involves 
no  more  than  a  single  dot  prodnct  with  the  measnrement  vector  followed  by  bin  assignment, 
rejection  proves  both  efficient  and  effective. 

as  a  16,384  dimensional  vector  by  reading  pixel  values  in  a  raster  scan  fashion.  Finally, 
the  image  vector  is  intensity  normalized  to  yield  a  unit  vector  which  is  fed  as  the  input 
measurement  vector  to  our  rejector. 

We  used  20  objects  in  our  experiments,  each  corresponding  to  a  class.  A  single 
image  of  each  object  is  displayed  in  Figure  3.  We  assume  that  each  of  the  objects  can 
appear  in  just  one  stable  conhguration.  Thus,  the  pose  of  the  object  with  respect  to  the 
viewer  is  given  by  a  single  rotation  parameter.  We  used  72  images  of  each  object  taken  at 
5°  intervals  of  pose.  The  images  were  divided  into  two  sets,  each  set  consisting  of  36  images 
separated  by  10°  of  pose.  One  set  of  images  was  used  as  training  samples  that  dehne  the 
classes,  and  the  other  set  was  reserved  exclusively  for  testing  the  composite  rejector. 

We  implemented  a  composite  rejector  for  the  20  objects  using  the  procedure  outlined 
in  Section  4.  As  an  example.  Figure  4  shows  one  of  the  constituent  rejectors.  A  representa¬ 
tion  of  the  entire  composite  rejector  is  illustrated  in  Figure  5,  part  of  which  is  expanded  in 
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Figure  6.  As  can  be  seen  in  Figure  5,  every  leaf  of  the  composite  rejector  contains  a  single 
class.  Hence,  the  composite  rejector  is  capable  of  discriminating  between  the  20  objects 
without  ambiguity,  ft  is  guaranteed  to  assign  a  unique  class  to  any  input  vector.  This  may 
be  regarded  as  fortunate.  The  aim  of  the  rejector  is  simply  to  eliminate  most  of  the  objects, 
and  we  would  have  regarded  the  rejector  as  successful  even  if  each  leaf  had  contained  up  to 
2-3  objects  that  needed  to  be  disambiguated  using  verihers.  We  applied  the  rejector  to  all 
72  images  of  each  object,  both  the  training  images  used  for  rejector  construction  as  well  as 
the  test  images  that  we  set  aside.  We  found  that  the  rejector  gave  100%  correct  response  in 
both  cases,  ft  is  worth  noting  that  the  composite  rejector  contains  just  30  simple  rejectors. 
This  should  be  compared  with  the  number  of  subsets  of  20  objects,  which  turns  out  to  be 
over  10®. 

As  seen  in  Figure  5,  the  longest  path  in  the  composite  rejector  consists  of  10  steps. 
Hence,  the  maximum  number  of  simple  rejectors  needed  to  eliminate  all  but  one  of  the 
classes  is  10.  By  assuming  that  each  image  in  the  data  set  is  equally  likely  to  appear, 
we  calculated  the  average  number  of  rejectors  needed  to  be  just  6.43.  In  other  words, 
the  average  run  time  of  the  composite  rejector  is  the  time  it  takes  to  compute  6.43  inner 
products  plus  the  small  overhead  of  walking  the  path  in  the  directed  graph.  Since  at  each 
node  there  are  at  most  4  possible  paths  to  take,  making  the  decision  consists  of  only  two 
comparisons.  This  efficiency  compares  very  favorably  with  the  results  obtained  by  Murase 
and  Nayar  [Murase  and  Nayar  95]  on  the  same  data.  Their  implementation  based  on  the 
Karhunen-Loeve  expansion  required  20  inner  products,  followed  by  a  sophisticated  search 
procedure.  If  the  time  to  calculate  the  inner  products  is  the  most  important  component 
in  the  overall  time  cost,  the  composite  rejector  is  approximately  3  times  more  efficient. 
Further,  given  dedicated  hardware  to  compute  inner  products,  the  rejector  will  yield  even 
better  improvement  in  performance  since  we  require  no  complex  search  procedure.  In  cases 
where  the  composite  rejector  has  leaves  with  multiple  objects,  tuned  verihers  of  the  type 
used  by  Murase  and  Nayar  [Murase  and  Nayar  95]  can  be  used  at  the  leaves  to  complete 
classihcation. 

We  investigated  the  growth  rate  of  the  number  of  rejectors  required  as  a  function 
of  the  number  of  classes  by  considering  subgraphs  of  the  composite  rejector.  The  set  of 
all  vertices  that  can  be  reached  from  a  given  node  in  the  composite  rejector,  can  itself  be 
regarded  as  a  composite  rejector,  but  for  a  reduced  subset  of  the  20  objects.  So,  for  each 
vertex  in  the  graph,  we  approximated  the  average  number  of  simple  rejectors  required  for 
the  composite  rejector  rooted  at  that  vertex.  In  Figure  7,  the  logarithm  of  the  number  of 
classes  for  which  the  rejector  is  designed  is  plotted  against  the  average  number  of  rejectors 
required.  Where  there  are  several  composite  rejectors  for  a  similar  number  of  classes, 
we  plot  the  average  over  all  such  cases.  We  calculated  a  least  squares  ht  of  a  straight 
line  (shown  as  a  solid  line)  to  the  data,  ft  is  evident  that  the  data  validates  our  previous 
theoretical  results  and  in  particular  equation  (5).  This  equation  predicted  that  the  required 
number  of  rejectors  would  be  a  logarithmic  function  of  the  number  of  objects. 

Using  the  same  image  database,  we  now  compare  the  performance  of  the  composite 
rejector  against  that  of  Fisher’s  discriminant  analysis  [Fisher  36].  Again,  we  followed  the 
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Figure  5:  A  representation  of  the  composite  rejector.  Each  interior  node  denotes  a  single  rejector 
and  is  labeled  with  the  set  of  objects  that  it  is  designed  to  act  on.  At  each  node,  only  one  inner 
prodnct  and  a  conple  of  comparisons  need  to  be  performed.  Each  leaf  denotes  a  possible  ontpnt 
of  the  composite  rejector,  ft  is  interesting  to  note  that  objects  with  similar  “gross  shape”  tend 
to  gronp  together  at  higher  levels  of  the  rejector,  and  are  only  separated  closer  to  the  leaves.  For 
example,  the  three  toy  cars  (objects  3,  6,  &  19)  are  almost  indistingnishable  nntil  all  other  objects 
are  eliminated. 
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Object  19  Object  5 

Figure  6:  A  small  part  of  the  composite  rejector.  As  the  number  of  classes  is  reduced  by  each 
successive  simple  rejector,  subsequent  rejectors  become  more  tuned  to  the  set  of  remaining  classes. 
The  hve  objects  are  hrst  reduced  to  three,  then  two,  and  hnaUy  just  one. 
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Figure  7:  A  graph  of  the  number  of  objects  against  the  average  number  of  simple  rejectors 
required  to  completely  discriminate  between  objects.  The  graph  is  plotted  using  a  log  scale  on 
the  abscissa,  implying  a  logarithmic  growth  rate  in  the  time  complexity. 

same  test  procedure,  namely,  setting  aside  half  of  the  data,  and  using  the  other  half  to 
construct  the  classiher.  Then,  we  constructed  the  Fisher  spaces  [Duda  and  Hart  73]  of 
different  dimensions.  In  Fisher  space  the  classes  consist  of  tight  clusters,  which  we  model 
as  multivariate  normal  distributions.  We  computed  the  mean  and  covariance  matrix  of 
each  of  these  distributions.  Then,  each  measurement  vector  was  classihed  by  hnding  its 
closest  cluster,  i.e.  the  cluster  whose  mean  is  closest  to  the  vector.  We  used  both  the 
Mahalanobis  and  Euclidean  distances.  Figure  8  shows  the  results  plotted  as  a  graph  of 
the  percentage  of  test  images  correctly  classihed,  against  the  dimension  of  the  Fisher  space 
used.  The  results  shown  are  for  the  combined  performance  on  the  training  and  test  sets. 
The  classiher  performs  slightly  better  on  the  test  set  and  slightly  worse  on  the  set  aside 
data.  However,  the  difference  between  the  two  is  always  less  than  1%. 

The  Mahalanobis  distance  gave  consistently  better  results  than  the  Euclidean  dis¬ 
tance.  However,  even  for  the  Mahalanobis  measure,  classihcation  results  are  not  perfect. 
In  fact,  the  highest  correct  classihcation  rate  of  96.6%  was  attained  for  dimension  19.  This 
compares  poorly  with  the  perfect  classihcation  obtained  by  the  composite  rejector,  that 
uses  an  average  of  just  6.43  rejection  vectors.  The  main  reason  for  the  rejector’s  superior 
performance  is  that  its  hierarchical  structure  eliminates  classes  step  by  step,  while  the  re¬ 
jector  used  at  each  step  is  optimal  for  the  classes  the  step  seeks  to  distinguish  between. 
As  is  seen  in  Figure  6,  rejectors  closer  to  the  leaves  are  tuned  to  a  reduced  set  of  classes. 
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Figure  8:  Results  of  applying  Fisher’s  discriminant  analysis  to  the  data  set  in  Fignre  3.  On 
the  abscissa  we  plot  the  dimension  of  the  Fisher  space  nsed,  and  on  the  ordinate  the  percentage 
of  test  images  correctly  classihed.  The  peak  performance  is  96.6%  correct,  and  to  reach  this 
19  discriminant  vectors  are  needed.  In  contrast,  the  composite  rejector  gives  perfect  (100%) 
classihcation  with  jnst  6.43  rejection  vectors.  Hence,  by  both  measnres,  efficiency  and  robnstness, 
the  composite  rejector  ontperforms  Fisher’s  discriminant  analysis. 

and  so  are  less  “distracted”  by  other  classes.  In  contrast,  all  dimensions  of  the  Fisher 
space  simultaneously  seek  to  classify  the  entire  set  of  classes.  As  a  result,  the  second  and 
subsequent  dimensions  turn  out  to  be  suboptimal  when  compared  with  the  second  and 
subsequent  layers  of  a  composite  rejector. 


5.2  Local  Feature  Detection 

Another  important  problem  in  computational  vision  which  can  be  reduced  to  pattern  recog¬ 
nition  is  the  detection  of  local  features  (edges,  lines,  corners,  etc.)  in  an  image.  The  decision 
of  whether  the  local  feature  appears  at  a  given  pixel  in  an  image,  is  based  entirely  on  the 
intensity  values  in  a  surrounding  window  of  d  pixels.  Treating  these  intensity  values  as  real 
numbers,  we  have  the  classihcation  space,  S  —  3?'^.  If  we  can  characterize  the  class,  say, 
Wf,  oi  intensity  vectors  which  represent  the  feature,  the  problem  of  feature  detection  is 
reduced  to  deciding  whether  a  measurement  vector,  x  E  IF/. 

For  lack  of  space,  we  will  concentrate  solely  on  the  step  edge  as  the  example  fea- 
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Figure  9:  The  ideal  model  of  a  step  edge.  We  consider  a  window  of  size  7x7  aronnd  the  center 
pixel.  A  straight  line  at  angle  6  to  the  x-axis  separates  the  window  into  two  constant  intensity 
regions.  Wlien  discretized,  the  pixels  which  the  line  passes  throngh  are  assigned  their  intensity 
levels  nsing  an  anti-ahasing  algorithm  that  calcnlates  the  average  pixel  intensity. 

ture.  The  step  edge  is  the  simplest  and  most  widely  explored  feature.  Efficient  edge 
detectors  have  been  proposed,  however,  the  more  sophisticated  detectors  (for  instance, 
complete  implementations  of  the  Canny  edge  detector  [Canny  86]  and  the  Nalwa-Binford 
detector [Nalwa  and  Binford  86])  are  less  efficient.  Furthermore,  elaborate  detectors  are 
unavoidable  in  the  case  of  more  complex  features,  such  as  lines  and  corners,  as  shown 
in  [Nayar  et  al.  95].  For  such  features,  there  is  no  obvious  equivalent  to  the  gradient  or 
Laplacian  operators  that  are  often  used  for  edges.  In  short,  as  a  rule  of  thumb,  high  feature 
complexity  and/or  high  detection  accuracy  require  the  use  of  computationally  expensive 
detectors.  This  makes  feature  detection  a  prime  candidate  for  the  application  of  rejection 
theory. 

The  major  methods  of  edge  detection  are  categorized  in  [Nalwa  93]  by  how  they 
define  the  set,  Wj.  Difference  operators,  such  as  the  Canny  edge  operator  [Canny  86]  and 
the  Marr- Hildreth  operator  [Marr  and  Hildreth  80],  implicitly  define  Wf  in  terms  of  the 
magnitude  of  the  gradient  (or  the  Laplacian),  of  the  underlying  image  intensity  function. 
Model  matching  methods  such  as  [Nalwa  and  Binford  86]  define  Wj  using  an  ideal  param¬ 
eterized  model  of  the  edge,  which  is  mapped  into  the  classification  space,  N,  by  modeling 
the  imaging  process.  We  follow  the  model  based  approach  since  it  gives  us  an  explicit, 
rather  than  implicit,  definition  oi  Wj. 

We  used  a  three  parameter  model  of  a  step  edge,  which  is  illustrated  in  Figure  9. 
The  edge  model  occupies  a  window  that  includes  7x7  pixels  in  the  image,  which  leads  to  a 
classification  space  of  dimension,  d  —  49.  The  parameters  consist  of  the  two  intensity  levels, 
A  and  H,  on  the  opposite  sides  of  the  edge,  and  the  angle,  0,  of  the  edge.  The  following 
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Figure  10:  The  edge  rejector  applied  to  3  noisy  synthetic  images.  The  top  row  shows  the  noisy 
images  whose  pixels  the  rejector  is  applied  to.  The  image  on  the  left  has  added  Ganssian  noise  of 
standard  deviation  1  grey  level,  the  middle  image  has  noise  of  2  grey  levels,  and  the  right  image 
has  noise  of  4  grey  levels.  The  bottom  row  shows  the  ontpnt  images  prodnced  by  the  edge  rejector. 
Each  ontpnt  image  consists  of  rejected  pixels  (marked  black)  and  candidate  pixels  (marked  white) 
that  conld  be  fed  into  an  elaborate  edge  detector  such  as  the  one  described  in  [Nayar  et.  al  95]. 

three  step  normalization  allows  us  to  eliminate  both  of  the  intensity  parameters,  A  and  B, 
without  effecting  the  underlying  edge  structure: 

1.  Given  a  vector  x  —  (x^, .  .  . ,  ^49)^,  calculate  x  —  ^  ■  J2i=i  ^i- 

2.  Subtract  x  from  each  coordinate  of  x  to  get,  x'  —  (xi  —  x, .  .  .  ,  X49  —  x)^. 

3.  Calculate  the  norm  ||x'||  of  x'  and  return  the  unit  vector  x'/||x'||. 

If  the  input  vector  is  found  to  conform  to  the  edge  model,  the  parameters  A  and  B  may  be 
recovered  using,  A^  i?  RS  x±  ||a:^||,  the  approximation  arising  from  the  fact  that  the  images 
are  discretely  sampled. 

We  constructed  a  composite  rejector  and  applied  it  to  a  set  of  synthetic  images 
and  to  a  real  image.  The  results  are  displayed  in  Figures  10  and  11,  respectively.  The 
synthetic  images  in  Figure  10  are  of  size  256  X  256  pixels,  and  consists  of  a  high  intensity 
polygon  on  a  low  intensity  background.  The  polygon  is  bounded  by  5  line  segments  at 
angles  15.4°,  67.4°,  107.9°,  187.9°,  and  322.2°  to  the  horizontal  axis.  The  difference  in  the 
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Figure  11:  The  edge  rejector  applied  to  a  real  sceire.  The  output  on  the  right  coirsists  of  a  large 
number  of  pixels  (marked  black)  which  the  rejection  algorithm  has  quickly  elimiirated  from  further 
coirsideratioir  aird  a  small  number  of  pixels  (marked  white)  which  it  has  decided  are  cairdidates 
to  be  verihed  by  a  sophisticated  edge  detector. 

intensity  across  each  segment  is  50  grey  levels.  We  used  three  synthetic  images  to  which 
we  added  Gaussian  noise.  In  the  hrst  image  the  added  noise  had  standard  deviation  1  grey 
level,  the  second  2  grey  levels,  and  in  the  third  4  grey  levels. 

A  composite  rejector  consisting  of  6  rejectors  was  applied  to  each  of  the  synthetic 
images  in  Figure  10.  The  output  images  were  passed  through  a  simple  relaxation  algorithm 
to  remove  a  few  scattered  false  positives.  Since  the  rejector  terminates  as  soon  as  it  hrst 
rejects  the  pixel  as  not  containing  an  edge,  not  all  6  rejectors  are  used  at  all  pixels.  In  the 
least  noisy  image  an  average  (computed  over  the  whole  image)  of  1.61  rejectors  were  used. 
For  the  more  noisy  images,  1.82  rejectors  and  2.34  rejectors  were  used,  respectively. 

In  Figure  11,  we  show  similar  results  for  a  real  image  of  size  393  X  289  pixels  taken 
in  the  laboratory.  We  used  a  composite  rejector  with  11  rejectors,  of  which  an  average  of 
1.81  were  required  at  each  pixel.  Again,  the  output  of  the  rejector  is  shown  after  it  has 
been  passed  through  the  relaxation  algorithm  to  remove  isolated  false  positives. 


6  Discussion 

Our  major  contribution  has  been  to  focus  on  computational  (as  opposed  to  representa¬ 
tional)  approaches  to  general  recognition  problems,  and  to  introduce  a  framework,  centered 
upon  the  pattern  classes,  in  which  the  complexity  of  such  problems  can  be  studied.  More 
specihcally,  the  key  results  of  our  work  include: 

1.  We  have  provided  conditions  for  logarithmic  growth  in  time  complexity  as  a  function 
of  the  number  of  classes,  and  verihed  this  behavior  empirically  for  an  important  ap- 
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plication  in  computational  vision.  However,  further  investigation  of  these  conditions 
is  needed  to  enhance  our  understanding  of  when  they  can  be  expected  to  hold. 

2.  We  analyzed  the  growth  in  the  number  of  rejectors  required  to  construct  a  composite 
rejector.  The  key  is  the  number  of  possible  outputs  of  the  rejectors  and  the  amount 
of  intersection  between  them.  This  growth,  rather  than  the  time  complexity,  may 
well  turn  out  to  the  limiting  factor  in  the  scalability  of  rejection.  A  comparison  with 
the  much  less  conservative  k-d  trees  [Friedman  et  al.  77]  would  probably  throw  light 
on  what  is  essentially  a  time-space  tradeoff. 

3.  The  class  assumption  is  at  the  heart  of  our  technique  for  constructing  rejectors.  As 
expected,  it  holds  for  some  classes  far  more  than  for  others.  Further  study  of  when 
and  why  it  holds  would  be  useful.  Given  the  derived  relationship  between  rejection 
and  the  K-L  expansion,  this  is  equivalent  to  asking  how  well  the  K-L  expansion  can 
be  expected  to  perform.  This  question  was  raised  in  the  context  of  object  recognition 
in  [Murase  and  Nayar  95]  and  still  remains  unanswered  in  that  application. 

4.  We  have  compared  pattern  rejection  with  Fisher’s  discriminant  analysis  and  demon¬ 
strated  rejection  to  be  superior.  Although  discriminant  analysis  is  formally  “optimal,” 
its  optimality  is  more  with  respect  to  representation  and  not  efficiency.  Further,  it  is 
only  the  hrst  Fisher  vector  that  can  really  be  regarded  as  optimal.  We  have  shown 
that  far  better  accuracy,  efficiency,  and  discriminating  power  results  from  the  hierar¬ 
chical  structure  of  a  composite  rejector. 
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A  Estimation  of  the  Thresholds 

The  only  property  the  thresholds,  must  comply  with,  for  the  derived  rejector,  to 
behave  correctly  as  a  rejector,  is  that  each  class  Wi  and  its  rejection  domain  should 
be  disjoint.  To  ensure  this,  we  only  require: 

X  ^Wi  |(r,  x)  —  (r,  Cj)|  <  8i  (since  then  Def’n  8  x  ^  Rf’" )  (H) 

Further,  the  smaller  we  can  make  6i  without  compromising  the  correct  behavior  of  the 
rejector,  the  more  effective  we  can  expect  ipr  to  be. 
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The  exact  details  of  how  to  estimate  the  best  value  of  6i  are  largely  application  de¬ 
pendent,  since  it  depends  heavily  on  the  nature  of  the  distribution  of  the  random  variable, 
X  —>■  (r,  x),  and  also  the  tolerance  to  error.  One  possibility  is  to  select  6i  based  on  measure¬ 
ments  of  the  number  of  errors  made  by  the  implemented  rejector.  This  was  the  approach 
taken  in  our  feature  detection  experiments,  where  careful  adjustments  can  be  made  during 
the  design  and  testing  of  the  feature  detector.  Another  method  is  to  assume  a  general 
parameterized  form  for  the  distribution  and  select  6i  based  on  the  estimated  parameters  of 
the  distribution.  In  the  object  recognition  experiments,  it  was  found  empirically  (see  Fig¬ 
ure  4)  that  the  distributions  for  all  objects  reasonably  approximated  normal  distributions. 
To  be  conservative,  we  chose  a  conhdence  level  of  over  99.9%,  and  so  6i  was  selected  as  3.5 
times  the  estimated  standard  deviation  of  the  distribution. 


B  Implementation  of  the  Derived  Rejector 

We  address  a  potential  problem  with  the  original  dehnition  of  the  derived  rejector.  We 
rewrite  the  dehnition  here  for  convenience: 

leil^rix)  \{r,x)  -  {r,c^)\  <  S,  (12) 

The  intervals,  [(r,  Cj)  —  (r,  Cj)  +  may  overlap  in  complicated  ways  leading  to  a  rejector 

with  a  large  number  of  different  output  sets,  possibility  as  large  as  2n.  Hence,  there  is  a 
danger  that  this  rejector  design  will  lead  to  a  very  large  composite  rejector. 

We  redesign  the  rejector  by  introducing  buckets  which  form  a  partition  of  [—1,1]. 
The  concept  of  a  bucket  is  illustrated  in  Figure  12.  We  divide  [—1,1]  into  m  neighboring 
buckets,  6i,...,6m,  where  Vj,  bj  —  [cutj_i,  cutj].  We  also  require  that  cuto  =  —1,  and 
cutm  =  1-  Each  point  cutj  is  referred  to  as  a  cut-point.  Once  we  have  decided  on  the 
cut-points,  and  hence  the  buckets,  we  associate  with  each  bucket  bj  the  set  of  classes  IW, 
with  which  the  bucket  intersects  the  interval,  [(r,  Cj)  —  (r,  Cj)  +  ^j]: 

classes(6j)  =  {%  :  bj  D  [(r,  c,)  -  (r,  c,)  +  ^,]  %  0}.  (13) 

ft  follows  from  equation  (13)  that: 

X  E  IVi  and  (r,x)  E  bj  i  E  classes(6j)  (14) 

Hence,  ^r{x)  —  classes(6j),  where  bj  is  the  unique  bucket  for  which  (r,  x)  E  bj.,  is  a  valid 
redehnition  of  a  derived  rejector.  Using  a  binary  search  and  a  lookup  table,  the  modihed 
rejector  can  be  implemented  with  one  inner  product  and  a  logarithmic  (in  the  number  of 
buckets)  number  of  comparisons.  The  new  derived  rejector  will  not  be  quite  as  effective 
as  the  original  one,  but  will  lead  to  a  much  smaller  composite  rejector.  This  is  a  classic 
example  of  the  time-space  trade-off. 

The  reason  for  introducing  the  notion  of  a  bucket  is  so  that  we  may  carefully  select 
the  cut-points,  so  as  to  follow  the  design  guidelines  introduced  in  Section  3.6.  We  use  the 
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Figure  12:  The  interval  [—1, 1],  which  corresponds  to  aU  possible  inner  prodncts  with  a  rejection 
vector,  is  partitioned  into  bnckets.  Neighboring  bnckets  are  separated  by  cnt-points.  Each  bncket 
is  associated  with  a  set  of  classes,  namely,  those  which  have  a  non-empty  intersection  with  the 
bncket.  We  amend  the  design  of  the  derived  rejector  to  retnrn  the  set  of  classes  associated  with 
the  bncket  into  which  the  measnrement  vector  is  projected. 

following  algorithm  that  aims  to:  (a)  keep  the  number  of  buckets,  and  hence  the  number 
of  outputs,  small,  (b)  balance  the  sizes  of  the  output  subsets,  classes(6j),  and  (c)  minimize 
the  intersection  between  the  output  subsets: 

Algorithm:  Choice  of  the  cut-points,  {cutj  :  j  =  0, 1, .  .  .  ,  m}. 

1.  Initialize  the  set,  J  —  {  —  1,1}. 

2.  Put  M  —  {(r^Ci)  —  8i  ■.  z  =  1,  2, . . . ,  n}  U  {(r,  Cj)  +  :  z  =  1,  2, . . . ,  n},  and  M'  —  0. 

3.  Sort  the  set  M.  For  each  consecutive  pair  of  numbers  in  M,  put  their  mean  in  M'. 

4.  For  each  point,  x  E  M',  in  turn,  insert  x  into  J  if  and  only  if: 

Vz  =  1,2, . . .  ,n,  X  ^[{r,c,)  -  8,,{r,c,)  +  8,]  (15) 

5.  Add  to  J,  the  points  which  maximize  over  all  y  6  M',  the  expression: 

min(|{z  :  y  <  {r,c,)  -  8i]\,\{i  \  y  >  (r,  c,)  +  ^ J |)  (16) 

6.  Return  the  set  of  cut-points,  J . 
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